From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Abstract

Reinforcement Learning From Human Feedback (RLHF) has been a critical to thesuccess of the latest generation of generative AI models. In response to thecomplex nature of the classical RLHF pipeline, direct alignment algorithms suchas Direct Preference Optimization (DPO) have emerged as an alternativeapproach. Although DPO solves the same objective as the standard RLHF setup,there is a mismatch between the two approaches. Standard RLHF deploysreinforcement learning in a specific token-level MDP, while DPO is derived as abandit problem in which the whole response of the model is treated as a singlearm. In this work we rectify this difference, first we theoretically show thatwe can derive DPO in the token-level MDP as a general inverse Q-learningalgorithm, which satisfies the Bellman equation. Using our theoretical results,we provide three concrete empirical insights. First, we show that because ofits token level interpretation, DPO is able to perform some type of creditassignment. Next, we prove that under the token level formulation, classicalsearch-based algorithms, such as MCTS, which have recently been applied to thelanguage generation space, are equivalent to likelihood-based search on a DPOpolicy. Empirically we show that a simple beam search yields meaningfulimprovement over the base DPO policy. Finally, we show how the choice ofreference policy causes implicit rewards to decline during training. Weconclude by discussing applications of our work, including informationelicitation in multi-tun dialogue, reasoning, agentic applications andend-to-end training of multi-model systems.

Quick Read (beta)

loading the full paper ...